This notebook provides an analysis on retail and warehouse sales. The data contains 307,645 sales records and 9 columns, including retail sales, warehouse sales, name of suppliers, and item descriptions.
The cleaning and formatting process includes:
The analysis answers a variety of business questions such as:
Most of the data visualizations are done using Plotly express because of their interactive capabilities. Seaborn is also used to create heatmaps.
import pandas as pd
import os
import seaborn as sns
pwd = os.getcwd()
sales_data = pd.read_csv(pwd + "\\Warehouse_and_Retail_Sales.csv")
shape = sales_data.shape
print(f"There are {shape[0]} rows and {shape[1]} columns.")
print(f"{shape[0]*shape[1]} total elements.")
There are 307645 rows and 9 columns. 2768805 total elements.
list(sales_data.columns)
['YEAR', 'MONTH', 'SUPPLIER', 'ITEM CODE', 'ITEM DESCRIPTION', 'ITEM TYPE', 'RETAIL SALES', 'RETAIL TRANSFERS', 'WAREHOUSE SALES']
sales_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307645 entries, 0 to 307644 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 YEAR 307645 non-null int64 1 MONTH 307645 non-null int64 2 SUPPLIER 307478 non-null object 3 ITEM CODE 307645 non-null object 4 ITEM DESCRIPTION 307645 non-null object 5 ITEM TYPE 307644 non-null object 6 RETAIL SALES 307642 non-null float64 7 RETAIL TRANSFERS 307645 non-null float64 8 WAREHOUSE SALES 307645 non-null float64 dtypes: float64(3), int64(2), object(4) memory usage: 21.1+ MB
sales_data.isnull().sum()
YEAR 0 MONTH 0 SUPPLIER 167 ITEM CODE 0 ITEM DESCRIPTION 0 ITEM TYPE 1 RETAIL SALES 3 RETAIL TRANSFERS 0 WAREHOUSE SALES 0 dtype: int64
print(f"There is {round(sales_data.isnull().sum().sum()/2768805,4)}% missing data in this dataset.")
There is 0.0001% missing data in this dataset.
supplier_nan = pd.isnull(sales_data["SUPPLIER"])
sales_data[supplier_nan]
| YEAR | MONTH | SUPPLIER | ITEM CODE | ITEM DESCRIPTION | ITEM TYPE | RETAIL SALES | RETAIL TRANSFERS | WAREHOUSE SALES | |
|---|---|---|---|---|---|---|---|---|---|
| 107 | 2020 | 1 | NaN | 107 | JIGGER MEASURE SHOT GLASS | STR_SUPPLIES | 14.69 | 18.0 | 0.0 |
| 189 | 2020 | 1 | NaN | 113 | BARTENDERS BLACK BOOK | STR_SUPPLIES | 0.40 | 0.0 | 0.0 |
| 233 | 2020 | 1 | NaN | 115 | PLASTIC SHOT GLASS PACK | STR_SUPPLIES | 5.71 | 6.0 | 0.0 |
| 254 | 2020 | 1 | NaN | 117 | WHISKEY TASTING JOURNAL | STR_SUPPLIES | 0.08 | 0.0 | 0.0 |
| 263 | 2020 | 1 | NaN | 118 | PLASTIC WINE GLASS PACK | STR_SUPPLIES | 7.40 | 10.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307414 | 2020 | 9 | NaN | 3 | COUPON | NON-ALCOHOL | NaN | 0.0 | 0.0 |
| 307443 | 2020 | 9 | NaN | 4 | RMS ITEM | NON-ALCOHOL | NaN | 0.0 | 0.0 |
| 307493 | 2020 | 9 | NaN | 107 | JIGGER MEASURE SHOT GLASS | STR_SUPPLIES | 8.62 | 16.0 | 0.0 |
| 307601 | 2020 | 9 | NaN | WC | WINE CREDIT | REF | 0.00 | 0.0 | -70.0 |
| 307605 | 2020 | 9 | NaN | 115 | PLASTIC SHOT GLASS PACK | STR_SUPPLIES | 4.21 | 6.0 | 0.0 |
167 rows × 9 columns
item_type_nan = pd.isnull(sales_data["ITEM TYPE"])
sales_data[item_type_nan]
| YEAR | MONTH | SUPPLIER | ITEM CODE | ITEM DESCRIPTION | ITEM TYPE | RETAIL SALES | RETAIL TRANSFERS | WAREHOUSE SALES | |
|---|---|---|---|---|---|---|---|---|---|
| 106749 | 2017 | 10 | REPUBLIC NATIONAL DISTRIBUTING CO | 347939 | FONTANAFREDDA BAROLO SILVER LABEL 750 ML | NaN | 0.0 | 0.0 | 1.0 |
retail_nan = pd.isnull(sales_data["RETAIL SALES"])
sales_data[retail_nan]
| YEAR | MONTH | SUPPLIER | ITEM CODE | ITEM DESCRIPTION | ITEM TYPE | RETAIL SALES | RETAIL TRANSFERS | WAREHOUSE SALES | |
|---|---|---|---|---|---|---|---|---|---|
| 18396 | 2020 | 7 | NaN | 4 | RMS ITEM | NON-ALCOHOL | NaN | 0.0 | 0.0 |
| 307414 | 2020 | 9 | NaN | 3 | COUPON | NON-ALCOHOL | NaN | 0.0 | 0.0 |
| 307443 | 2020 | 9 | NaN | 4 | RMS ITEM | NON-ALCOHOL | NaN | 0.0 | 0.0 |
sales_data["SUPPLIER"].fillna(sales_data["SUPPLIER"].mode().iloc[0], inplace=True)
sales_data["ITEM TYPE"].fillna(sales_data["ITEM TYPE"].mode().iloc[0], inplace=True)
sales_data["RETAIL SALES"].fillna(sales_data["RETAIL SALES"].mean(), inplace=True)
sales_data.isnull().sum()
YEAR 0 MONTH 0 SUPPLIER 0 ITEM CODE 0 ITEM DESCRIPTION 0 ITEM TYPE 0 RETAIL SALES 0 RETAIL TRANSFERS 0 WAREHOUSE SALES 0 dtype: int64
sales_data[supplier_nan]
| YEAR | MONTH | SUPPLIER | ITEM CODE | ITEM DESCRIPTION | ITEM TYPE | RETAIL SALES | RETAIL TRANSFERS | WAREHOUSE SALES | |
|---|---|---|---|---|---|---|---|---|---|
| 107 | 2020 | 1 | REPUBLIC NATIONAL DISTRIBUTING CO | 107 | JIGGER MEASURE SHOT GLASS | STR_SUPPLIES | 14.690000 | 18.0 | 0.0 |
| 189 | 2020 | 1 | REPUBLIC NATIONAL DISTRIBUTING CO | 113 | BARTENDERS BLACK BOOK | STR_SUPPLIES | 0.400000 | 0.0 | 0.0 |
| 233 | 2020 | 1 | REPUBLIC NATIONAL DISTRIBUTING CO | 115 | PLASTIC SHOT GLASS PACK | STR_SUPPLIES | 5.710000 | 6.0 | 0.0 |
| 254 | 2020 | 1 | REPUBLIC NATIONAL DISTRIBUTING CO | 117 | WHISKEY TASTING JOURNAL | STR_SUPPLIES | 0.080000 | 0.0 | 0.0 |
| 263 | 2020 | 1 | REPUBLIC NATIONAL DISTRIBUTING CO | 118 | PLASTIC WINE GLASS PACK | STR_SUPPLIES | 7.400000 | 10.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307414 | 2020 | 9 | REPUBLIC NATIONAL DISTRIBUTING CO | 3 | COUPON | NON-ALCOHOL | 7.024071 | 0.0 | 0.0 |
| 307443 | 2020 | 9 | REPUBLIC NATIONAL DISTRIBUTING CO | 4 | RMS ITEM | NON-ALCOHOL | 7.024071 | 0.0 | 0.0 |
| 307493 | 2020 | 9 | REPUBLIC NATIONAL DISTRIBUTING CO | 107 | JIGGER MEASURE SHOT GLASS | STR_SUPPLIES | 8.620000 | 16.0 | 0.0 |
| 307601 | 2020 | 9 | REPUBLIC NATIONAL DISTRIBUTING CO | WC | WINE CREDIT | REF | 0.000000 | 0.0 | -70.0 |
| 307605 | 2020 | 9 | REPUBLIC NATIONAL DISTRIBUTING CO | 115 | PLASTIC SHOT GLASS PACK | STR_SUPPLIES | 4.210000 | 6.0 | 0.0 |
167 rows × 9 columns
sales_data[retail_nan]
| YEAR | MONTH | SUPPLIER | ITEM CODE | ITEM DESCRIPTION | ITEM TYPE | RETAIL SALES | RETAIL TRANSFERS | WAREHOUSE SALES | |
|---|---|---|---|---|---|---|---|---|---|
| 18396 | 2020 | 7 | REPUBLIC NATIONAL DISTRIBUTING CO | 4 | RMS ITEM | NON-ALCOHOL | 7.024071 | 0.0 | 0.0 |
| 307414 | 2020 | 9 | REPUBLIC NATIONAL DISTRIBUTING CO | 3 | COUPON | NON-ALCOHOL | 7.024071 | 0.0 | 0.0 |
| 307443 | 2020 | 9 | REPUBLIC NATIONAL DISTRIBUTING CO | 4 | RMS ITEM | NON-ALCOHOL | 7.024071 | 0.0 | 0.0 |
sales_data[item_type_nan]
| YEAR | MONTH | SUPPLIER | ITEM CODE | ITEM DESCRIPTION | ITEM TYPE | RETAIL SALES | RETAIL TRANSFERS | WAREHOUSE SALES | |
|---|---|---|---|---|---|---|---|---|---|
| 106749 | 2017 | 10 | REPUBLIC NATIONAL DISTRIBUTING CO | 347939 | FONTANAFREDDA BAROLO SILVER LABEL 750 ML | WINE | 0.0 | 0.0 | 1.0 |
sales_data[sales_data.duplicated()]
| YEAR | MONTH | SUPPLIER | ITEM CODE | ITEM DESCRIPTION | ITEM TYPE | RETAIL SALES | RETAIL TRANSFERS | WAREHOUSE SALES |
|---|
sns.heatmap(sales_data.corr(),annot = True)
C:\Users\dchap\AppData\Local\Temp\ipykernel_24356\1292874986.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning. sns.heatmap(sales_data.corr(),annot = True)
<AxesSubplot: >
sales_data.drop(columns= "RETAIL TRANSFERS", inplace = True)
sales_data.head()
| YEAR | MONTH | SUPPLIER | ITEM CODE | ITEM DESCRIPTION | ITEM TYPE | RETAIL SALES | WAREHOUSE SALES | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2020 | 1 | REPUBLIC NATIONAL DISTRIBUTING CO | 100009 | BOOTLEG RED - 750ML | WINE | 0.00 | 2.0 |
| 1 | 2020 | 1 | PWSWN INC | 100024 | MOMENT DE PLAISIR - 750ML | WINE | 0.00 | 4.0 |
| 2 | 2020 | 1 | RELIABLE CHURCHILL LLLP | 1001 | S SMITH ORGANIC PEAR CIDER - 18.7OZ | BEER | 0.00 | 1.0 |
| 3 | 2020 | 1 | LANTERNA DISTRIBUTORS INC | 100145 | SCHLINK HAUS KABINETT - 750ML | WINE | 0.00 | 1.0 |
| 4 | 2020 | 1 | DIONYSOS IMPORTS INC | 100293 | SANTORINI GAVALA WHITE - 750ML | WINE | 0.82 | 0.0 |
#Format the new date in new column
sales_data["DATE"] = pd.to_datetime(sales_data["MONTH"].astype(str) + ' ' + sales_data["YEAR"].astype(str), format='%m %Y')
#Drop old date columns
sales_data.drop(columns = ["YEAR", "MONTH"], inplace = True)
#Reorder columns
sales_data = sales_data.loc[:,["DATE", "SUPPLIER", "ITEM CODE", "ITEM DESCRIPTION", "ITEM TYPE", "RETAIL SALES", "WAREHOUSE SALES"]]
sales_data.head()
| DATE | SUPPLIER | ITEM CODE | ITEM DESCRIPTION | ITEM TYPE | RETAIL SALES | WAREHOUSE SALES | |
|---|---|---|---|---|---|---|---|
| 0 | 2020-01-01 | REPUBLIC NATIONAL DISTRIBUTING CO | 100009 | BOOTLEG RED - 750ML | WINE | 0.00 | 2.0 |
| 1 | 2020-01-01 | PWSWN INC | 100024 | MOMENT DE PLAISIR - 750ML | WINE | 0.00 | 4.0 |
| 2 | 2020-01-01 | RELIABLE CHURCHILL LLLP | 1001 | S SMITH ORGANIC PEAR CIDER - 18.7OZ | BEER | 0.00 | 1.0 |
| 3 | 2020-01-01 | LANTERNA DISTRIBUTORS INC | 100145 | SCHLINK HAUS KABINETT - 750ML | WINE | 0.00 | 1.0 |
| 4 | 2020-01-01 | DIONYSOS IMPORTS INC | 100293 | SANTORINI GAVALA WHITE - 750ML | WINE | 0.82 | 0.0 |
sd_sorted_date = sales_data.sort_values("DATE")
sales_data = sd_sorted_date
sales_data.head()
| DATE | SUPPLIER | ITEM CODE | ITEM DESCRIPTION | ITEM TYPE | RETAIL SALES | WAREHOUSE SALES | |
|---|---|---|---|---|---|---|---|
| 54813 | 2017-06-01 | VINTUS LLC | 43919 | QUINTA DO NOVAL TAWNY 10YR - 750ML | WINE | 0.32 | 0.0 |
| 51848 | 2017-06-01 | DANGEROUS WINE GROUP LLC | 335314 | PALISSADE ROSE - 750ML | WINE | 0.24 | 0.0 |
| 51847 | 2017-06-01 | PWSWN INC | 335309 | MARC CELLARS CAB 12 - 750ML | WINE | 0.00 | 3.0 |
| 51846 | 2017-06-01 | PWSWN INC | 335289 | PASO ROBLES PETITE COTE - 750ML | WINE | 0.00 | 6.0 |
| 51845 | 2017-06-01 | A VINTNERS SELECTIONS | 335285 | SATELLITE S/BLC - 750ML | WINE | 0.00 | 6.0 |
import plotly.express as px
fig1 = px.histogram(sales_data, x="SUPPLIER").update_xaxes(categoryorder='total descending')
fig1.show()